FUNCTIONAL  BREGMAN  DIVERGENCE 


1 


Functional  Bregman  Divergence  and  Bayesian 
Estimation  of  Distributions 
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Abstract — A  class  of  distortions  termed  functional  Bregman 
divergences  is  defined,  which  includes  squared  error  and  relative 
entropy.  A  functional  Bregman  divergence  acts  on  functions  or 
distributions,  and  generalizes  the  standard  Bregman  divergence 
for  vectors  and  a  previous  pointwise  Bregman  divergence  that 
was  defined  for  functions.  A  recently  published  result  showed 
that  the  mean  minimizes  the  expected  Bregman  divergence.  The 
new  functional  definition  enables  the  extension  of  this  result 
to  the  continuous  case  to  show  that  the  mean  minimizes  the 
expected  functional  Bregman  divergence  over  a  set  of  functions 
or  distributions.  It  is  shown  how  this  theorem  applies  to  the 
Bayesian  estimation  of  distributions.  Estimation  of  the  uniform 
distribution  from  independent  and  identically  drawn  samples  is 
used  as  a  case  study. 

Index  Terms — Bregman  divergence,  Bayesian  estimation,  uni¬ 
form  distribution,  learning 

REGMAN  divergences  are  a  useful  set  of  distortion  func¬ 
tions  that  include  squared  error,  relative  entropy,  logistic 
loss,  Mahalanobis  distance,  and  the  Itakura-Saito  function. 
Bregman  divergences  are  popular  in  statistical  estimation  and 
information  theory.  Analysis  using  the  concept  of  Bregman  di¬ 
vergences  has  played  a  key  role  in  recent  advances  in  statistical 
learning  [1] — [10],  clustering  [11],  [12],  inverse  problems  [13], 
maximum  entropy  estimation  [14],  and  the  applicability  of  the 
data  processing  theorem  [15].  Recently,  it  was  discovered  that 
the  mean  is  the  minimizer  of  the  expected  Bregman  divergence 
for  a  set  of  d-dimensional  points  [11],  [16]. 

In  this  paper  we  define  a  functional  Bregman  divergence  that 
applies  to  functions  and  distributions,  and  we  show  that  this 
new  definition  is  equivalent  to  Bregman  divergence  applied 
to  vectors.  The  functional  definition  generalizes  a  pointwise 
Bregman  divergence  that  has  been  previously  defined  for 
measurable  functions  [7],  [17],  and  thus  extends  the  class  of 
distortion  functions  that  are  Bregman  divergences;  see  Section 
I-A.2  for  an  example.  Most  importantly,  the  functional  defi¬ 
nition  enables  one  to  solve  functional  minimization  problems 
using  standard  methods  from  the  calculus  of  variations;  we 
extend  the  recent  result  on  the  expectation  of  vector  Breg¬ 
man  divergence  [11],  [16]  to  show  that  the  mean  minimizes 
the  expected  Bregman  divergence  for  a  set  of  functions  or 
distributions.  We  show  how  this  theorem  links  to  Bayesian 
estimation  of  distributions.  For  distributions  from  the  expo¬ 
nential  family  distributions,  many  popular  divergences,  such 
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as  relative  entropy,  can  be  expressed  as  a  (different)  Bregman 
divergence  on  the  exponential  distribution  parameters.  The 
functional  Bregman  definition  enables  stronger  results  and  a 
more  general  application. 

In  Section  1  we  state  a  functional  definition  of  the  Bregman 
divergence  and  give  examples  for  total  squared  difference, 
relative  entropy,  and  squared  bias.  In  later  subsections,  the 
relationship  between  the  functional  definition  and  previous 
Bregman  definitions  is  established,  and  properties  are  noted. 
Then  in  Section  2  we  present  the  main  theorem:  that  the 
expectation  of  a  set  of  functions  minimizes  the  expected  Breg¬ 
man  divergence.  We  discuss  the  application  of  this  theorem  to 
Bayesian  estimation,  and  as  a  case  study  compare  different 
estimates  for  the  uniform  distribution  given  independent  and 
identically  drawn  samples.  Proofs  are  in  the  appendix.  Readers 
who  are  not  familiar  with  functional  derivatives  may  find 
helpful  our  short  introduction  to  functional  derivatives  [18] 
or  the  text  by  Gelfand  and  Fomin  [19]. 

I.  Functional  Bregman  Divergence 

Let  (Rd,  Q.  //]  be  a  measure  space,  where  v  is  a  Borel 
measure  and  d  is  a  positive  integer.  Let  ©  be  a  real  functional 
over  the  normed  space  Lp(y)  for  1  <  p  <  oo.  Recall  that  the 
bounded  linear  functional  5<j>\f\  ■}  is  the  Frechet  derivative  of 
(j>  at  /  £  Lp(u)  if 

4>[f  +  «]  -  <t>[f]  =  «] 

=  S<t>[f;a\+e[f,a\  ||a||tP(l/)  (1) 

for  all  a  €  Lp(v),  with  e[f,  a]  — >  0  as  ||a||iP^  — >  0  [19]. 
Then  given  an  appropriate  functional  <j>,  a  functional  Bregman 
divergence  can  be  defined: 

Definition  1.1  (Functional  Definition  of  Bregman  Divergence). 
Let  (f>  :  Lp{y)  -tlka  strictly  convex,  twice-continuously 
Frechet-differentiable  functional.  The  Bregman  divergence  d^  : 
Lp(v)  x  Lp(u)  —>  [0,oo)  is  defined  for  all  admissible  f,  g  6 
Lp(v)  as 

<M/>  9\  =  <P[f]  ~  <t>[ 9 ]  ~  HW,  f-g\,  (2) 

where  S(f>[g ;  •]  is  the  Frechet  derivative  of  <j>  at  g. 

Here,  we  have  used  the  Frechet  derivative,  but  the  definition 
(and  results  in  this  paper)  can  be  easily  extended  using  other 
definitions  of  functional  derivatives;  a  sample  extension  is 
given  in  Section  I-A.3. 
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FUNCTIONAL  BREGMAN  DIVERGENCE 


A.  Examples 

Different  choices  of  the  functional  <b  lead  to  different 
Bregman  divergences.  Illustrative  examples  are  given  for 
squared  error,  squared  bias,  and  relative  entropy.  Functionals 
for  other  Bregman  divergences  can  be  derived  based  on  these 
examples,  from  the  example  functions  for  the  discrete  case 
given  in  Table  1  of  [16],  and  from  the  fact  that  <j>  is  a  strictly 
convex  functional  if  it  has  the  form  <j>(g)  =  f  <j)(g(t))dt 
where  f  :  R  — >  R,  f  is  strictly  convex  and  g  is  in  some 
well-defined  vector  space  of  functions  [20], 

1)  Total  Squared  Difference:  Let  <j>[g\  =  f  g2dv,  where 
(j> :  L2(v)  — y  R,  and  let  g,  f,a  £  L2(v).  Then 


for  all  a  €  L2(v ),  which  implies  that  <f>  is  strictly  convex  and 
d4>[f,g]  =  j  .fdv  -  J  g2dv -2 j  g(f  -  g)dv 

=  J  if-  gfdu 

=  \\f-9\\l*(vy 

2)  Squared  Bias:  Under  definition  (2),  squared  bias  is  a 
Bregman  divergence,  this  we  have  not  previously  seen  noted 
in  the  literature  despite  the  importance  of  minimizing  bias  in 
estimation  [21], 

Let  (p[g\  =  (f  gdv )  ,  where  <f>  :  Ll{v)  — >  R.  In  this  case 


<t>[g  +  a]  -<t>\g]  =  J {g  +  a)2dv  -  J  g2dv 

=  2  f  gadv  +  f  a2  do. 


Because 

/ a2dv  _  llallz,2(i^) 

llall  £2W  |M|l2(i/) 

as  a  -a  0  in  L2(y ), 


=  IM \l2(v)  ->■  o 


2/^, 


A(/>[/;a]  =  S(p[f;  a]  +  ^S2(p[f-,  a,  a] 

+  e[f,  a]  \\a\\2LP{v) 

=  6<j>[f\  a]  +  ^S2(p[f;  a,  a] 
1  3 

+  </>[/;  a,  a,  a\ 

+  e[/,  a]  ||a||^p(l/) , 

\lp(v) 


(3) 


0  as  ||a||  mr.A  — »■  0.  The  quadratic  func- 


<t>[9  +  a)  -  4>[g\  =  (/  gdv  +  J  advj  -  (^J  gdv^j 


=  2J  d(lv  J  adv  +  y  j  adv  J  .  (4) 

Note  that  2  J  gdv  J  adv  is  a  continuous  linear  functional  on 
L1(i/)  and  ( f  adv)‘  2  <  |M||i (l/)>  so  that 


0  < 


2  wi.M 


=  \\a\\L'(v)- 


llall  D(v)  II  ||  L1  (is) 

Thus  from  (4)  and  the  definition  of  the  Frechet  derivative. 


which  is  a  continuous  linear  functional  in  a.  To  show  that  <i>  is 
strictly  convex  we  show  that  f  is  strongly  positive.  When  the 
second  variation  52cj)  and  the  third  variation  b3</>  exist,  they 
are  described  by 


S<p[g:  a]  =  2  J  gdv  J  adv. 


where  e[f,  a] 

tional  82(j>[f-,  a,  a]  defined  on  normed  linear  space  Lp(y)  is 
strongly  positive  if  there  exists  a  constant  k  >  0  such  that 
52(j)[f\  a,  a]  >  k  ||a||^p^)  for  all  a  €  L2{v).  By  definition  of 
the  second  Frechet  derivative, 

52(p[g\  b,  a]  =  5cp[g  +  b\  a]  -  5<j>[g;  a] 

=  2  J (g  +  b)adv  —  2  J  gadv 

=  2  J  badv. 

Thus  82(f>[g-,b,a\  is  a  quadratic  form,  where  82(f>  is  actually 
independent  of  g  and  strongly  positive  since 

82f[g;a,a\  =  2  J  a2dv  =  2||a|||/2(^) 


By  the  definition  of  the  second  Frechet  derivative, 

52<p[g-b,a\  =  6<p[g  +  6;  a]  -  6<j>[g-,  a] 

=  2  J (g  +  b)dv  J  adv  —  2  J  gdv  J  adv 

=  2  J  bdv  J  adv 

is  another  quadratic  form,  and  82(f>  is  independent  of  g. 

Then  82 <j)[g\  a,  a]  is  strongly  positive  because, 

52(p[g\  a,  a]  =  2  (^J  adv^j  =  2||a|||i(l/)  >  0 

for  a  £  Lliy),  and  thus  <fr  is  in  fact  strictly  convex.  The 
Bregman  divergence  is  thus 


«W/,  g\ 


fdv  - 


J  gdv^j  -  2  J  gdv  J (f  -  g)dv 


fdv  \  +  (  [  gdv]  -  2  /  gdv  [  fdv 


J(f-  g)dv 


<  11/  —  glll1^)- 
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3)  Relative  Entropy  of  Simple  Functions:  Denote  by  S  the 
collection  of  all  integrable  simple  functions  on  the  measure 
space  (M.d,Q,v),  that  is,  the  set  of  functions  which  can  be 
written  as  a  finite  linear  combination  of  indicator  functions 
such  that  if  g  £  S,  g  can  be  expressed, 

t 

g(x)  =  ^2  aiIrTi  5  ^0  =  0, 

i= 0 

where  Ip  is  the  indicator  function  of  the  set  T),  {Ti}*=1  is 
a  collection  of  mutually  disjoint  sets  of  finite  measure  and 
To  =  Rd\U  -=i  Tj  -  We  adopt  the  convention  that  To  is  the  set 
on  which  g  is  zero  and  therefore  a*  ^  0  if  i  0. 

Consider  the  normed  vector  space  (<S,||  •  and  let 

W  be  the  subset  (not  necessarily  a  vector  subspace)  of  non¬ 
negative  functions  in  this  normed  space: 

W  =  {g  £  S  subject  to  3>0}. 


If  g  €  W  then 


/  glngdi,=  2.  /  ai  In  aidv  =  ai  ln  a+{Ti),  (5) 
Md  i=1  Jtz  i=1 


since  0  In  0  =  0.  Define  the  functional  cf>  on  W, 


f\g\  =  /  9 hl  9  dv,  3  €  W.  (6) 

J  Rd 

The  functional  <b  is  not  Frechet-differentiable  at  g  because  in 
general  it  cannot  be  guaranteed  that  g  +  h  is  non-negative  on 
the  set  where  g  =  0  for  all  perturbing  functions  h  in  the 
underlying  normed  vector  space  (<S,  ||  •  ||z,=o(1/))  with  norm 
smaller  than  any  prescribed  e  >  0.  However,  a  generalized 
Gateaux  derivative  can  be  defined  if  we  limit  the  perturbing 
function  h  to  a  vector  subspace.  To  that  end,  let  Q  be  the 
subspace  of  ( S ,  ||  •  ||z,oo(„))  defined  by 


Q  =  {/  £  S  subject  to  /  dv  <C  g  dv}. 


It  is  straightforward  to  show  that  Q  is  a  vector  space.  We  define 


the  generalized  Gateaux  derivative  of  <j>  at  g  £  W  to 

be  the 

linear  operator  Scfig', 

•]  if 

lim  ^ 9  +  ^ 

-  5G<t>[g\  h]\  _  0 

(7) 

mu* 

(") 

heg 

Note,  that  6c(f>[g',  •]  is 

not  linear  in 

general,  but  it  is 

on  the 

vector  space  Q.  In  general,  if  Q  is  the  entire  underlying  vector 
space  then  (7)  is  the  Frechet  derivative,  and  if  Q  is  the  span 
of  only  one  element  from  the  underlying  vector  space  then 
(7)  is  the  Gateaux  derivative.  Here,  we  have  generalized  the 
Gateaux  derivative  for  the  present  case  that  Q  is  a  subspace 
of  the  underlying  vector  space. 

It  remains  to  be  shown  that  given  the  functional  (6),  the 
derivative  (7)  exists  and  yields  a  Bregman  divergence  corre¬ 
sponding  to  the  usual  notion  of  relative  entropy.  Consider  the 
possible  solution 

^Gf[g\h]=  f  (l  +  ln g)hdu,  (8) 

7Rd 


which  coupled  with  (6)  does  yield  relative  entropy.  It  remains 
to  be  shown  only  that  (8)  satisfies  (7).  Note  that 


f[g  +  h]  -  (j>[g\  -  Soflg-,  h\ 


[  (h  +  g)  In  — — ^  —  hdiy 

J  Rd  g 

[  (h  +  g)  In  — — ^ -  —  hdu , 


(9) 


where  E  is  the  set  on  which  g  is  not  zero. 

Because  g  £  W,  there  are  m,  M  >  0  such  that  m  <  g  <  M 
on  E.  Let  h  £  Q  be  such  that  ||ft.||z,~(i/)  <  m,  then  g  +  h  >  0. 
Our  goal  is  to  show  that  the  expression 


<t>\g  +  h}-  4>[g]  -  6Gf[g;  h] 

\\h\ 


(10) 


is  non-negative  and  that  it  is  bounded  above  by  a  bound 
that  goes  to  0  as  ||(i||z,°°0)  — ►  0.  We  start  by  bounding  the 
integrand  from  above  using  the  inequality  In  a:  <  x  —  1: 


(h  +  g)  In  — — ^ -  —  h  <  (h  + 

3)h-~ 

h  = 

h2 

g 

g 

9  ' 

Then  since  h2/g  <  (\\h\\Loo^)2 /m, 

f[g  +  h]~  <f>[g}  -  SG<j>[g-,  h\  ^ 

l 

l- 

\\h\\L°°(v) 

\\h\\L< 

Je  9 

< 

v{E) 

m 

Ml 

Since  g  is  integrable  v(E)  <  oo  and  the  right  hand  side  goes 
to  0  as  \\h\\L^(v)  ->  0. 

Next,  in  order  to  show  that  (10)  is  non-negative  we  have 
to  prove  that  the  integral  (9)  is  not  negative.  To  do  so,  we 
normalize  the  measure  and  apply  Jensen’s  inequality.  Take  the 
first  term  of  the  integrand  of  (9), 


[  (h  +  g)  In  — — 1 -dv 

Je  9 


h  +  g  h  +  g 


Ie  g 
MUhg 

MUmg 


gdv, 


h+g  h+g  g 

—  '"  —  Ml —,d" 

h  +  g ' 


/  A 

Ie  \  9 


dv, 


where  the  normalized  measure  dv  =  ,  + - dv  is  a  probabil- 

ity  measure  and  A(x)  =  xhix  is  a  convex  function  on  (0, oo). 
By  Jensen’s  inequality  and  then  changing  the  measure  back  to 
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dv. 


Nl 


h  +  g 
9 


dv 


>  ||ff|LiMA  (  [ 

\Je  9 

=  IIs,IIli(^a 


=  \\g  +  h\\ LiMln 


>  \\g  +  h\\LuV)  (i- 


Ilfl  +  ^1  \l1(v) 

9  +  h\\ l1 


9\W(v) 


II 9  +  h\\Li(i, 

=  ||5  +  Hl'(v)  -  \\g\\LHy)  =  /  hdv, 

J  E 

where  we  used  the  fact  that  In  ^  >  1  —  x  for  all  x  >  0.  By 
combining  these  two  latest  results  we  find  that 

J  (h  +  g)  In  — ^-^-dv  >  J  hdv, 

so  equivalently  (9)  is  always  non-negative.  This  fact  also 
confirms  that  the  resulting  relative  entropy  d,f,[f,g]  is  always 
non-negative,  because  (9)  is  d^fif ,  g]  if  one  sets  h=  f  —  g. 

Lastly,  one  must  show  that  the  functional  defined  in  (6)  is 
strictly  convex.  Again  we  will  show  this  by  showing  that  the 
second  variation  of  is  strongly  positive.  Let  /  €  Q  and 
II/I|l°°(i/)  <  m.  Using  the  Taylor  expansion  of  In  one  can 
express, 

+  /;  h]  -  5Gf[g\  h]  =  j  hln(l+  0  dv 

=  [  h-dv  +  e[f,g]\\f\\L^(v), 

J  e  9 

where  e[f,g]  goes  to  0  as  ||/|| l°°(v)  ~~ *  0  because 

ii  rr  in  /  v(E)\\h\ li»Mllrll 

mf,9}\\L°°M  <  — 


Therefore 


and 


h,  /]  =  f  h-du, 
Je  9 

8g<1>[9\  h,h]=  [  —dv 
Je  9 


IE 

>  — | 
-  M' 


\L1(V) 


Thus  6Q(j)[g\  h,  h]  is  strongly  positive. 


B.  Relationship  to  Other  Bregman  Divergence  Definitions 

Two  propositions  establish  the  relationship  of  the  functional 
Bregman  divergence  to  other  Bregman  divergence  definitions. 

Proposition  1.2  (Functional  Bregman  Divergence  Generalizes 
Vector  Bregman  Divergence).  The  functional  definition  (2)  is 
a  generalization  of  the  standard  vector  Bregman  divergence 

d^(x,y)  =  <f>(x)  -  4>(y)  -  Vf(y)T(x-  y),  (11) 


where  x,  y  G  R™,  and  <f> :  R™  — >  R  is  strictly  convex  and  twice 
differentiable. 

Jones  and  Byrne  describe  a  general  class  of  divergences 
between  functions  using  a  pointwise  formulation  [7],  Csiszar 
specialized  the  pointwise  formulation  to  a  class  of  divergences 
he  termed  Bregman  distances  Bs  l/  [17],  where  given  a  a- 
finite  measure  space  ( X ,  if  is),  and  non-negative  measurable 
functions  /( x)  and  g(x),  Bs^(f,g)  equals 

J  s(f(x))  -  s(g(x))  -  s'(g(x))(f(x)  -  g(x))dv(x).  (12) 

The  function  s  :  (0,  oo)  — >  R  is  constrained  to  be  dif¬ 
ferentiable  and  strictly  convex,  and  the  limit  liiti.,:_o  s(x) 
and  lim^^o  s'(x)  must  exist,  but  not  necessarily  be  finite. 
The  function  s  plays  a  role  similar  to  the  function  <l>  in  the 
functional  Bregman  divergence;  however,  s  acts  on  the  range 
of  the  functions  /,  g,  whereas  <j>  acts  on  the  functions  /,  g. 

Proposition  1.3  (Functional  Definition  Generalizes  Pointwise 
Definition).  Given  a  pointwise  Bregman  divergence  as  per 
(12),  an  equivalent  functional  Bregman  divergence  can  be 
defined  as  per  (2)  if  the  measure  v  is  finite.  However,  given 
a  functional  Bregman  divergence  d,p[f,  g],  there  is  not  neces¬ 
sarily  an  equivalent  pointwise  Bregman  divergence. 


C.  Properties  of  the  Functional  Bregman  Divergence 

The  Bregman  divergence  for  random  variables  has  some 
well-known  properties,  as  reviewed  in  [11,  Appendix  A], 
Here,  we  note  that  the  same  properties  hold  for  the  functional 
Bregman  divergence  (2).  We  give  complete  proofs  in  [18]. 

1.  Non-negativity:  The  functional  Bregman  divergence  is 
non-negative:  dj,\f,g\  >  0  for  all  admissible  inputs. 

2.  Convexity:  The  Bregman  divergence  d^  [f,g\  is  always 
convex  with  respect  to  /. 

3.  Linearity:  The  functional  Bregman  divergence  is  linear 
such  that, 

d(c\4>l+C2  02  )  [/,  g\  =  Cl  dp  [/,  g]  +  c2dp  [/,  g] . 

4.  Equivalence  Classes:  Partition  the  set  of  strictly  convex, 
differentiable  functionals  {<j>}  into  classes  such  that  and 
<f> 2  belong  to  the  same  class  if  dp  [/,  g]  =  dp  [/,  g)  for 
all  /,  g  £  A.  For  brevity  we  will  denote  dp  [/,  g]  simply 
by  dp.  Let  fi  ~  cj> 2  denote  that  fix  and  <f>2  belong  to  the 
same  class,  then  ~  is  an  equivalence  relation  in  that  is 
satisfies  the  properties  of  reflexivity  (dp  =  dp),  symmetry 
(if  dp  =  dp,  then  dp  =  dp),  and  transitivity  (if  dp  =  dp 
and  dp  =  dp,  then  dp  =  dp).  Further,  if  <j>i  ~  <j>2,  then 
they  differ  only  by  an  affine  transformation. 

5.  Linear  Separation:  The  locus  of  admissible  functions 
/  £  Gp(v)  that  are  equidistant  from  two  fixed  functions 

G  Lp(v)  in  terms  of  functional  Bregman  divergence 
form  a  hyperplane. 
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6.  Dual  Divergence:  Given  a  pair  ( g ,  <j>)  where  g  £  Lp(v)  and 
(j>  is  a  strictly  convex  twice-continuously  Frechet-differentiable 
functional,  then  the  function-functional  pair  ( G,ip )  is  the 
Legendre  transform  of  (g,(f>)  [19],  if 


<t>[9 }  = 

-ip[G)  +  J  g{x)G(x)dv{x), 

(13) 

5(p[g\  a]  = 

J  G(x)a(x)du(x), 

(14) 

where  ip  is  a  strictly  convex  twice-continuously  Frechet- 
differentiable  functional,  and  G  £  Lq{v ),  where  -  +  -  =  1. 

Given  Legendre  transformation  pairs  /,  g  £  Lp(y)  and 
F,G  £  Lq(v), 

d<i,[f,g\  =  d^[G,F\. 

7.  Generalized  Pythagorean  Inequality:  For  any  admissible 

f,9,h  £  Lp(v), 

d<t,[f,h\  =  d<t>[f,g\  +d,p[g,h\  +  S(p[g-  f  -  g]  -  5<j>[h\  f  -  g\. 

II.  Minimum  Expected  Bregman  Divergence 

Consider  two  sets  of  functions  (or  distributions),  A4  and 
A.  Let  F  £  Ml  he  a  random  function  with  realization  /. 
Suppose  there  exists  a  probability  distribution  Pp  over  the 
set  AA,  such  that  Pf(/)  is  the  probability  of  f  £  M.. 
For  example,  consider  the  set  of  Gaussian  distributions,  and 
given  samples  drawn  independently  and  identically  from  a 
randomly  selected  Gaussian  distribution  N ,  the  data  imply 
a  posterior  probability  P/v  (Af)  for  each  possible  generating 
realization  of  a  Gaussian  distribution  M .  The  goal  is  to  find 
the  function  g*  £  A  that  minimizes  the  expected  Bregman 
divergence  between  the  random  function  F  and  any  function 
g  £  A.  The  following  theorem  shows  that  if  the  set  of  possible 
minimizers  A  includes  EPf[F],  then  g*  =  EpF[F]  minimizes 
the  expectation  of  any  Bregman  divergence.  Note  the  theorem 
requires  slightly  stronger  conditions  on  <j>  than  the  definition 
of  the  Bregman  divergence  (2)  requires. 

Theorem  II.  1  (Minimizer  of  the  Expected  Bregman  Di¬ 
vergence).  Let  52<f>[f ;  a ,  a]  be  strongly  positive  and  let 
4>  £  C3(L1(u);  K)  be  a  three-times  continuously  Frechet- 
differentiable  functional  on  L 1(v).  Let  M.  be  a  set  of  functions 
that  lie  on  a  manifold  M,  and  have  associated  measure 
dM  such  that  integration  is  well-defined.  Suppose  there  is  a 
probability  distribution  Pp  defined  over  the  set  A4.  Let  A  be 
a  set  of  functions  that  includes  EpF[F]  if  it  exists.  Suppose 
the  function  g*  minimizes  the  expected  Bregman  divergence 
between  the  random  function  F  and  any  function  g  £  A  such 
that 

g*  =  arg  inf  EpF[dtf>{F,  g)]. 

Then,  if  g*  exists,  it  is  given  by 

(15) 


A.  Bayesian  Estimation 

Theorem  II.  1  can  be  applied  to  a  set  of  distributions  to 
find  the  Bayesian  estimate  of  a  distribution  given  a  posterior 
or  likelihood.  For  parametric  distributions  parameterized  by 
9  £  R",  a  probability  measure  A(0),  and  some  risk  function 
R(9,ip),  ip  €  R",  the  Bayes  estimator  is  defined  [22]  as 

0  =  arg  inf  [  R(Q,ip)dA(0).  (16) 

eR"  J 

That  is,  the  Bayes  estimator  minimizes  some  expected  risk  in 
terms  of  the  parameters.  It  follows  from  recent  results  [16] 
that  9  =  E[Q\  if  the  risk  R  is  a  Bregman  divergence,  where 
0  is  the  random  variable  whose  realization  is  9;  this  property 
has  been  previously  noted  [8],  [10]. 

The  principle  of  Bayesian  estimation  can  be  applied  to  the 
distributions  themselves  rather  than  to  the  parameters: 

(j  =  arg  inf  f  R(f,g)PF(f)dM,  (17) 

g£AJM 

where  Pp( f)  is  a  probability  measure  on  the  distributions  /  £ 
AL  dM  is  a  measure  for  the  manifold  M,  and  A  is  either 
the  space  of  all  distributions  or  a  subset  of  the  space  of  all 
distributions,  such  as  the  set  M.  When  the  set  A  includes 
the  distribution  EpF[F]  and  the  risk  function  It  in  (17)  is  a 
functional  Bregman  divergence,  then  Theorem  II.  1  establishes 
that  g  =  EPf[F}. 

For  example,  in  recent  work,  two  of  the  authors  derived  the 
mean  class  posterior  distribution  for  each  class  for  a  Bayesian 
quadratic  discriminant  analysis  classifier,  and  showed  that 
the  classification  results  were  superior  to  parameter-based 
Bayesian  quadratic  discriminant  analysis  [23], 

Of  particular  interest  for  estimation  problems  are  the  Breg¬ 
man  divergence  examples  given  in  Section  I-A:  total  squared 
difference  (mean  squared  error)  is  a  popular  risk  function  in 
regression  [21];  minimizing  relative  entropy  leads  to  useful 
theorems  for  large  deviations  and  other  statistical  subfields 
[24];  and  analyzing  bias  is  a  common  approach  to  character¬ 
izing  and  understanding  statistical  learning  algorithms  [21], 

B.  Case  Study:  Estimating  a  Scaled  Uniform  Distribution 

As  an  illustration  of  the  theorem,  we  present  and  compare 
different  estimates  of  a  scaled  uniform  distribution  given 
independent  and  identically  drawn  samples.  Let  the  set  of 
uniform  distributions  over  [0, 9]  for  9  £  R+  be  denoted 
by  U.  Given  independent  and  identically  distributed  samples 
X\ ,  X2 , ... ,  Xn  drawn  from  an  unknown  uniform  distribution 
/  £U,  the  generating  distribution  is  to  be  estimated.  The  risk 
function  It  is  taken  to  be  squared  error  or  total  squared  error 
depending  on  context. 

1 )  Bayesian  Parameter  Estimate:  Depending  on  the  choice 
of  the  probability  measure  A(0),  the  integral  (16)  may  not  be 
finite;  for  example,  using  the  likelihood  of  9  with  Lebesgue 
measure  the  integral  is  not  finite.  A  standard  solution  is  to  use 
a  gamma  prior  on  9  and  Lebesgue  measure.  Let  0  be  a  random 
parameter  with  realization  9,  let  the  gamma  distribution  have 
parameters  t\  and  t, 2,  and  denote  the  maximum  of  the  data  as 


9*  =  EPf[F\. 
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Imax  =  max{Xi,  X2, . . . ,  Xn}.  Then  a  Bayesian  estimate  is 
formulated  [22,  p.  240,  285]: 


E[Q\{X1,X2,...,Xn},t1,t2] 

=  fx, majI  ^gn+n+ie9^  d6 


(18) 


The  integrals  can  be  expressed  in  terms  of  the  chi-squared 
random  variable  If  with  v  degrees  of  freedom: 


E[Q\{X1,X2,...,Xn},tut2}  = 

1  -P(X2(n+ti-l)  <  t2Xmax) 

1 2(n  +  t!-o)  P(x!(n+tl)  <  j^)  ' 

Note  that  (16)  presupposes  that  the  best  solution  is  also  a 
uniform  distribution. 

2)  Bayesian  Uniform  Distribution  Estimate:  If  one  restricts 
the  minimizer  of  (17)  to  be  a  uniform  distribution,  then  (17)  is 
solved  with  A  =  U.  Because  the  set  of  uniform  distributions 
does  not  generally  include  its  mean.  Theorem  II.  1  does  not  ap¬ 
ply,  and  thus  different  Bregman  divergences  may  give  different 
minimizers  for  (17).  Let  Pp  be  the  likelihood  of  the  data  (no 
prior  is  assumed  over  the  set  IP).  and  use  the  Fisher  information 
metric  (  [25]— [27])  for  dM.  Then  the  solution  to  (17)  is 
the  uniform  distribution  on  [0, 21/”Xmax],  Using  Lebesgue 
measure  instead  gives  a  similar  result:  [0,  21/(ra+1(2lXmax]. 
We  were  unable  to  find  these  estimates  in  the  literature,  and 
so  their  derivations  are  presented  in  the  appendix. 

3)  Unrestricted  Bayesian  Distribution  Estimate:  When  the 
only  restriction  placed  on  the  minimizer  g  in  (17)  is  that  g 
be  a  distribution,  then  one  can  apply  Theorem  II.  1  and  solve 
directly  for  the  expected  distribution  EpF[F\.  Let  Pp  be  the 
likelihood  of  the  data  (no  prior  is  assumed  over  the  set  U). 
and  use  the  Fisher  information  metric  for  dM.  Solving  (15), 
noting  that  the  uniform  probability  of  x  is  /( x)  =  1/a  if 
x  <  a  and  zero  otherwise,  and  the  likelihood  of  the  n  drawn 
points  is  (1/Xmax)”  if  a  >  Xmax  and  zero  otherwise, 

5*0)  = 


L 


max.(x,Xj 


(£)(£)(£) 


/; 


OO 

n(X  m 


1  da 


0 


(n  +  1)  [max  (a;,  Wmax)] 


n+1  ' 


(20) 


III.  Further  Discussion  and  Open  Questions 
We  have  defined  a  general  Bregman  divergence  for  functions 
and  distributions  that  can  provide  a  foundation  for  results  in 
statistics,  information  theory  and  signal  processing.  Theorem 
II.  1  is  important  for  these  fields  because  it  ties  Bregman 
divergences  to  expectation.  As  shown  in  Section  II-A,  The¬ 
orem  II.  1  can  be  directly  applied  to  distributions  to  show 
that  Bayesian  distribution  estimation  simplifies  to  expectation 
when  the  risk  function  is  a  Bregman  divergence  and  the 
minimizing  distribution  is  unrestricted. 

It  is  common  in  Bayesian  estimation  to  interpret  the  prior 
as  representing  some  actual  prior  knowledge,  but  in  fact  prior 
knowledge  often  is  not  available  or  is  difficult  to  quantify. 
Another  approach  is  to  use  a  prior  to  capture  coarse  informa¬ 
tion  from  the  data  that  may  be  used  to  stabilize  the  estimation 


[9],  [23].  In  practice,  priors  are  sometimes  chosen  in  Bayesian 
estimation  to  tame  the  tail  of  likelihood  distributions  so  that 
expectations  will  exist  when  they  might  otherwise  be  infinite 
[22],  This  mathematically  convenient  use  of  priors  adds  esti¬ 
mation  bias  that  may  be  unwarranted  by  prior  knowledge.  An 
alternative  to  mathematically  convenient  priors  is  to  formulate 
the  estimation  problem  as  a  minimization  of  an  expected 
Bregman  divergence  between  the  unknown  distribution  and 
the  estimated  distribution,  and  restrict  the  set  of  distributions 
that  can  be  the  minimizer  to  be  a  set  for  which  the  Bayesian 
integral  exist.  Open  questions  are  how  such  restrictions  trade¬ 
off  bias  for  reduced  variance,  and  how  to  find  or  define  an 
“optimal”  restricted  set  of  distributions  for  this  estimation 
approach. 

Finally,  there  are  some  results  for  the  standard  vector 
Bregman  divergence  that  have  not  been  extended  here.  It  has 
been  shown  that  a  standard  vector  Bregman  divergence  must 
be  the  risk  function  in  order  for  the  mean  to  be  the  minimizer 
of  an  expected  risk  [16,  Theorems  3  and  4],  The  proof  of  that 
result  relies  heavily  on  the  discrete  nature  of  the  underlying 
vectors,  and  it  remains  an  open  question  as  to  whether  a  similar 
result  holds  for  the  functional  Bregman  divergence.  Another 
result  that  has  been  shown  for  the  vector  case  but  remains 
an  open  question  in  the  functional  case  is  convergence  in 
probability  [16,  Theorem  2], 
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Appendix:  Proofs 
A.  Proof  of  Proposition  1.2 

We  give  a  constructive  proof  that  there  is  a  corresponding 
functional  Bregman  divergence  d,,,  [/,  g]  for  a  specific  choice  of 
(j> :  Ll(v)  R,  where  u  =  Yfi.-i  anc*  />5  ^  Here, 

Sx  denotes  the  Dirac  measure  such  that  all  mass  is  concentrated 
at  x,  and  {ci,  C2, . . . ,  cn}  is  a  collection  of  n  distinct  points 
in  Rd. 

For  any  x  £  Rn,  define  <j>[f]  =  <j>(xi,  X2,  ■  ■  ■ ,  xn),  where 
/(ci)  =  xi ,  /(c2)  =  x2,  ■  ■  ■ ,  f  (.Cn)  =  xn.  Then  the  difference 
is 


A </>[/;  a]  =  <j>[f  +  a]  -  <j>[f] 

=  4>((f  +  a)(c1),.  •■,(/  + a)  (c„)) 

=  4>(xi  +  a(ci),  ...,xn  +  a(cn))  -  <j>  Ot,  ■  •  • ,  xn) . 

Let  a,;  be  short  hand  for  a(cf),  and  use  the  Taylor  expansion 
for  functions  of  several  variables  to  yield 

A <j>[f;a\  =  V0( xi, . . .  ,a’n)T(ai, . . .  ,an)  +  e[/,a]||a||Li. 

Therefore, 

54>[f;a\  =  X4>(xi,...,xn)T(ai,...,a„)  =  V0(; v)Ta, 
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where  x  =  (xi,X2, . . . ,  xn)  and  a  =  (ai, . . . ,  an).  Thus,  from 
(3),  the  functional  Bregman  divergence  definition  (2)  for  6  is 
equivalent  to  the  standard  vector  Bregman  divergence: 

d$[f,  g\  =  4>[f]  -  Ha]  -  H[a;  f  -  a) 

=  4>{x)  -  4>(y)  -  V0(y)T(a:  -  y).  (2i) 


B.  Proof  of  Proposition  1.3 

First,  we  give  a  constructive  proof  of  the  first  part  of 
the  proposition  by  showing  that  given  a  BSt„,  there  is  an 
equivalent  functional  divergence  d Then,  the  second  part 
of  the  proposition  is  shown  by  example:  we  prove  that  the 
squared  bias  functional  Bregman  divergence  given  in  Section 
I-A.2  is  a  functional  Bregman  divergence  that  cannot  be 
defined  as  a  pointwise  Bregman  divergence. 

Note  that  the  integral  to  calculate  Bsv  is  not  always  finite. 
To  ensure  finite  we  explicitly  constrain  limx^o  s'  (x) 

and  lim;,;^o  s(x)  to  be  finite.  From  the  assumption  that  s 
is  strictly  convex,  s  must  be  continuous  on  (0,  oo).  Recall 
from  the  assumptions  that  the  measure  v  is  finite,  and  that  the 
function  s  is  differentiable  on  (0,  oo). 

Given  a  Bs  v,  define  the  continuously  differentiable  function 


§0) 


s(a:)  x  >  0 

— s(— x)  +  2s(0)  x  <  0. 


Specify  f  :  L°°  (y)  — >  ffi.  as 


Hf)  =  [  s(f(x))dv. 

J  x 

Note  that  if  /  >  0, 

<t>[f]  =  [  s(f(x))du. 

J  x 

Because  s  is  continuous  on  K,  s(f)  £  L°°{v)  whenever  f  £ 
L°°(y),  so  the  above  integrals  always  make  sense. 

It  remains  to  be  shown  that  5<j)[f\  •]  completes  the  equiva¬ 
lence  when  /  >  0.  For  h  £  L°°(v), 


<t>[f  +  h)  -  <t>[f]  =  [  ~s(f(x) +  Hx))dv  -  f  s(f(x))du 
J  X  Jx 

=  f  s(f(x )  +  h(x))  -  s(f(x))dv 
Jx 


=  /  s'(f(x))h(x)  +  e{f(x),h(x))  h(x)dn 


ix 


lx 

where  we  used  the  fact  that 


=  /  s’ (f(x))h(x)  +  e{f(x),h(x))h{x)dv1 


s(f{x)  +  h(x)) 

=  s(f  (x))  +  ( s'(f(x ))  +  e(f(x),h(x)))  h(x) 
=  s(f(x))  +  (s'(f( x))  +  e(f(x),h(x)))  h(x), 


because  /  >  0.  On  the  other  hand,  if  h(x)  =  0  then 
e(f(x),h(x))  =  0,  and  if  h{x)  ^  0  then 


\e(f(x),h(x))\  < 


s(f(x )  +  h(x))  -  s(f(x)) 
h(x) 


Suppose  {hn}  C  L°°(y)  such  that  hn  —>  0.  Then  there  is 
a  measurable  set  E  such  that  its  complement  is  of  measure  0 
and  hn  — >  0  uniformly  on  E.  There  is  some  N  >  0  such  that 
for  any  n  >  N,  \hn(x)\  <  e  for  all  x  £  E.  Without  loss  of 
generality,  assume  that  there  is  some  M  >  0  such  that  for  all 
x  £  E,  |  f(x)  |  <  M.  Since  s  is  continuously  differentiable, 
there  is  a  K  >  0  such  that  max{s'(f)  subject  to  t  £  \—M  — 
e,  M  +  e] }  <  K,  and  by  the  mean  value  theorem 

s{f(x)  +  h(x))  -  s(/( x)) 

- m - -  • 

for  almost  all  x  £  X.  Then 


Hf(x),h(x)) |  <  2 K, 

except  on  a  set  of  measure  0.  The  fact  that  h(x)  — >  0  almost 
everywhere  implies  that  \e(f(x),h(x))\  — >  0  almost  every¬ 
where,  and  by  Lebesgue’s  dominated  convergence  theorem, 
the  corresponding  integral  goes  to  0.  As  a  result,  the  Frechet 
derivative  of  d>  is 


H[f\h]=  /  s'(f(x))h(x)du. 


(22) 


ix 


Thus  the  functional  Bregman  divergence  is  equivalent  to  the 
given  pointwise  Bs  „. 

We  additionally  note  that  the  assumptions  that  /  £  L°°(v) 
and  that  the  measure  v  is  finite  are  necessary  for  this  proof. 
Counterexamples  can  be  constructed  if  /  £  Lp  or  v(X)  =  oo 
such  that  the  Frechet  derivative  of  <fi  does  not  obey  (22).  This 
concludes  the  first  part  of  the  proof. 

To  show  that  the  squared  bias  functional  Bregman  diver¬ 
gence  given  in  Section  I-A.2  is  an  example  of  a  functional 
Bregman  divergence  that  cannot  be  defined  as  a  pointwise 
Bregman  divergence  we  prove  that  the  converse  statement 
leads  to  a  contradiction. 

Suppose  (X,  E,  v)  and  (X,  E ,  p)  are  measure  spaces  where 
v  is  a  non-zero  er-finite  measure  and  that  there  is  a  differen¬ 
tiable  function  /  :  (0,  oo)  — >  R  such  that 

2 

(23) 


/  =  J  f((,)dp, 


where  £  £  L 1(y).  Let  /( 0)  =  linx^o  f(x),  which  can  be 
finite  or  infinite,  and  let  a  be  any  real  number.  Then 


f{atf)dp  =  (  /  )  =  a 


(dn 


=  a2  J  f(£)dp. 


Because  v  is  a-finite,  there  is  a  measurable  set  E  such  that 
0  <  |i/(E')|  <  oo.  Let  X\E  denote  the  complement  of  E  in 
X.  Then 

2 


2  2 
a  v 


(E)  =  a2  Isdi^j 

=  a2  J  f  (I e) dp 

f(0)dp  +  a2  [  f(l)dp 
J  E 


=  a 


J  X\E  JE 

=  a2f(0)p(X\E)  +  a2f(l)p(E). 
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Also, 


However, 


a2u2(E)  =  (^J  alEdv'j  ■ 


J  aIEdv^j  =  J  f(aIE)d/j. 

=  [  f(aIE)dg  +  [  f{aIE)dg 

Jx\e  Je 

=  f(0)g(X\E)  +  f(a)n(E)-, 

so  one  can  conclude  that 


a2f(0)g(X\E)  +  a2f(l)g(E) 

=  f(0)g(X\E)  +  f(a)g(E).  (24) 


Let 

J\g\  =  EpF[d<p(F,g)\=  [  d^f,  g]PF(f)dM 

J  M 

=  f  W}  -  m  ~  H[g-  f  -  g])PF(f)dM,( 28) 
J  M 

where  (28)  follows  by  substituting  the  definition  of  Bregman 
divergence  (2).  Consider  the  increment 


A  J\g;  a]  =  J[g  +  a]  -  J[g\  (29) 

=  -  I  (<?%  +  a]  -  <j>[g\)  PF(f)dM 

J  M 

-  [  (6<j>[g  +  a;f  -  g-a\ 

J  M 


-6<j>[g-,f-g])PF(f)dM, 


(30) 


Apply  equation  (23)  for  C  =  0  to  yield 

0=  ^  Odi/)  =  J  f(0)dg=  f(0)g(X). 

Since  \v(E)\  >  0,  g(X)  ^  0,  so  it  must  be  that  /( 0)  =  0, 
and  (24)  becomes 

a2v2{E)  =  a2  f(T)g(E)  =  f{a)g{E)  Va  €  R. 


The  first  equation  implies  that  /if E)  ^  0.  The  second 
equation  determines  the  function  /  completely: 


f(a)  =  f(l)a2. 


Then  (23)  becomes 

(/^)  =  f 

Consider  any  two  disjoint  measurable  sets,  E\  and  E2 ,  with 
finite  nonzero  measure.  Define  Ci  =  IEl  and  C2  =  Ie2-  Then 
C  =  £1  +  £2  and  C1C2  =  IEiIe2  =  0-  Equation  (23)  becomes 

J  Cl  d,v  J  &dv  =  /( 1)  J  Cl &dg.  (25) 

This  implies  the  following  contradiction: 

J  Ci dv  J  C2 dv  =  v(E1)v(E2)  ±  0,  (26) 

but 

/(l)  J  Ci &d»  =  0.  (27) 

C.  Proof  of  Theorem  II. 1 

Recall  that  for  a  functional  J  to  have  an  extremum  (mini¬ 
mum)  at  /  =  /,  it  is  necessary  that 

5J[f;  a]  =  0  and  82J[f;  a,  a]  >  0, 

for  f  =  f  and  for  all  admissible  functions  a  €  A.  A  sufficient 
condition  for  a  functional  J[f]  to  have  a  minimum  for  f  =  f 
is  that  the  first  variation  SJ[f\  a]  must  vanish  for  /  =  /,  and 
its  second  variation  <52J[/;  a,  a]  must  be  strongly  positive  for 
/  =  /• 


where  (30)  follows  from  substituting  (28)  into  (29).  Using  the 
definition  of  the  differential  of  a  functional  given  in  (1),  the 
first  integrand  in  (30)  can  be  written  as 


f\g  +  a\  -  f[g]  =  5<f>[g;  a]  +  e[g,  a]  ||a||Li(l/) .  (31) 
Take  the  second  integrand  of  (30),  and  subtract  and  add 

H[g\  f-g-a\. 


df[g  +  a;  f  —  g  —  a]  —  5<t>[g\  f  -  g] 

=  S(j>[g  +  a\f  -g-a\-  Sf[g-  f  -  g  -  a] 

+  H[g\  f  -g-a]-  Sf[g;  f  -  g] 

=  (52</%;  f  -g-a,  a\  +  e[g,  a]  ||a||  L1{v)  +  6</>[g-,  f  -  g] 

-  5(j)[g;  a]  -  Sf[g;  f  -  g] 

=  d2  4>[g\  f  —  g,  a]  -  52f[g;a,a\  +  e[g,a\  ||a||il(l/) 

-  df[g\  a]  (32) 

where  (a)  follows  from  the  linearity  of  the  third  term,  and  (6) 
follows  from  the  linearity  of  the  first  term.  Substitute  (31)  and 
(32)  into  (30), 

A  J[g\a]  =  -  [  (s2(j)[g-  f  —  g,a\  —  52  f[g\  a,  a] 

JM  v 

+  e[g,a]  \\a\\L1(v)^PF(f)dM. 

Note  that  the  term  82 <f>[g\  a,  a]  is  of  order  Hall^i^),  that 
is,  ||<52C>[<7;  a,  a]  \\L1^  <  m.\\a\\2Ll^  for  some  constant  to. 
Therefore, 

||  J[g  +  a}~  J\g\- 5J{g-,a\\\L1{v) 

lmi  - 77— tt - =  0, 

IMIiao,- >0  INI  £!(„) 

where, 

6J[g;a]  =  -  f  62</)[g;  f  -  g,  a]PF(f)dM.  (33) 

J  M 

For  fixed  a,  <52<^>[^;  • ,  a,]  is  a  bounded  linear  functional  in  the 
second  argument,  so  the  integration  and  the  functional  can  be 
interchanged  in  (33),  which  becomes 


SJ[g;a]  =  -S2(j) 


g\  /  (/  -  g)  PF(f)dM,a 

JM 
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Using  the  functional  optimality  conditions,  J[g\  has  an  ex¬ 
tremum  for  g  =  g  if 


52(j> 


<7!  /  (f  -  9)pF(f)dM,a 

JM 


=  0. 


(34) 


Set  a  =  jM  (/  —  g)  P(f)dM  in  (34)  and  use  the  assumption 
that  the  quadratic  functional  524>[g\  a,  a]  is  strongly  positive, 
which  implies  that  the  above  functional  can  be  zero  if  and 
only  if  a  =  0,  that  is. 


0  =  f  (/  -  g)PF(f)dM,  (35) 

JM 

9  =  EPf[F },  (36) 


where  the  last  line  holds  if  the  expectation  exists  (i.e.  if  the 
measure  is  well-defined  and  the  expectation  is  finite).  Because 
a  Bregman  divergence  is  not  necessarily  convex  in  its  second 
argument,  it  is  not  yet  established  that  the  above  unique 
extremum  is  a  minimum.  To  see  that  (36)  is  in  fact  a  minimum 
of  J[g\,  from  the  functional  optimality  conditions  it  is  enough 
to  show  that  S2  J[g\  a,  a]  is  strongly  positive.  To  show  this,  for 
6  £  T1(^),  consider 

5J[g  +  b\  a]  -  SJ[g;  a] 

=  -  [  {52(t>[g  +  6;  /  -  g-b,a] 

JM 

-  f  ~  9,  a])PF(f)dM 

=  -  [  (82<t>[g  +  b;  f  —  g  —  b,a]  —  52f[g;  f  -g-b,  a] 

JM 

+  b2f[g;  f  -  9-b,  a}-  S2(p[g;  f  -  g,  a])PF(f)dM 

~  (^^[g'J  -  9-b,a,b]+e[g,a,b}\\b\\Ll,) 

JM 

+  b2  (j>[g\  f  ~  g,  a]-  52(j){g ;  b,  a] 

-  52(t>[g\  f  -  g,  a})PF{f)dM 

=  -[  (83(j>[g-,f-g,a,b\-83<j)[g-,b,a,b\ 

JM 

+  e[g,a,b\\\b\\Liiv)  -  S2f[g;b,a])PF(f)dM,  (37) 

where  (c)  follows  from  using  integral  (33);  (d)  from  subtract¬ 
ing  and  adding  52<f>[g',  f  —  g  —  b,  a\,  (e)  from  the  fact  that  the 
variation  of  the  second  variation  of  f  is  the  third  variation 
of  <f>  [28];  and  (/)  from  the  linearity  of  the  first  term  and 
cancellation  of  the  third  and  fifth  terms.  Note  that  in  (37)  for 
fixed  a,  the  term  534>[g\ b,  a,  b]  is  of  order  ||6||^ i^,  while  the 
first  and  the  last  terms  are  of  order  ).  Therefore, 

\\6J[g  +  b;a\-  5J[g\  a]  -  S2  J[g;  a,  b]\\L1(  , 
lim  - [ttt] - =  0, 

IHInoo-o  IHI  L1(«/) 

where 

62J[g-a,b\  =  -  f  53<j)[g\  f  -  g,  a,  b]PF(f)dM 

J  M 

+  [  S2(f>[g'  a,  b]PF(f)dM.  (38) 

JM 

Substitute  b  =  a,  g  =  g  and  interchange  integration  and  the 


continuous  functional  83(j>  in  the  first  integral  of  (38),  then 


52J[g-a,a\  =  —53(j> 


9 ;  [  (/  -  g)PF(f)dM,a,a 
JM 

[  S2(j)[g-a,a\PF(f)dM 

M 

62(p[g-  a,  a\PF{f)dM  (39) 


IM 


>  f  k  IMIli(„)  Pp(f)dM 

J  M 

=  k\\a\\2LHu)  >  0,  (40) 

where  (39)  follows  from  (35),  and  (40)  follows  from  the 
strong  positivity  of  52(j>[g\  a,  a].  Therefore,  from  (40)  and  the 
functional  optimality  conditions,  g  is  the  minimum. 


D.  Derivation  of  the  Bayesian  Distribution-based  Uniform 
Estimate  Restricted  to  a  Uniform  Minimizer 

Let  f(x)  =  l/a  for  all  0  <  x  <  a  and  g{x)  =  1/6  for  all 
0  <  x  <  b.  Assume  at  first  that  6  >  a;  then  the  total  squared 
difference  between  /  and  g  is 

/(/(*)  -a(x)fix  =  +(*--«)  (I) 

b  —  a 
ab 

\b~a\ 

ab 

where  the  last  line  does  not  require  the  assumption  that  b  >  a. 

In  this  case,  the  integral  (17)  is  over  the  one-dimensional 
manifold  of  uniform  distributions  U\  a  Riemannian  metric  can 
be  formed  by  using  the  differential  arc  element  to  convert 
Lebesgue  measure  on  the  set  U  to  a  measure  on  the  set  of 
parameters  a  such  that  (17)  is  re-formulated  in  terms  of  the 
parameters  for  ease  of  calculation: 


6*  =  arg  min  / 

bSR+  Ja=X„ 


\b  —  a\  1 
ab  an 


da 


da,  (41) 


where  l/an  is  the  likelihood  of  the  n  data  points  being 
drawn  from  a  uniform  distribution  [0,  a],  and  the  estimated 
distribution  is  uniform  on  [0,6*].  The  differential  arc  element 
r'!a  can  be  calculated  by  expanding  df/da  in  terms  of 

the  Haar  orthonormal  basis  {^, <j>jk(x)},  which  forms  a 
complete  orthonormal  basis  for  the  interval  0  <  x  <  a,  and 
then  the  required  norm  is  equivalent  to  the  norm  of  the  basis 
coefficients  of  the  orthonormal  expansion: 


df_ 

da 


2 


1 


(42) 


For  estimation  problems,  the  measure  determined  by  the 
Fisher  information  metric  may  be  more  appropriate  than 
Lebesgue  measure  [25]— [27],  Then 

dM  =  \I(a)\ida,  (43) 


where  /  is  the  Fisher  information  matrix.  For  the  one¬ 
dimensional  manifold  M  formed  by  the  set  of  scaled  uniform 
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distributions  U ,  the  Fisher  information  matrix  is 

,  ,  r  /d2iogixi 
/(a)  = 

=  r^-dx=  4, 

J0  a2  a  cr 

so  that  the  differential  element  is  dM  =  — . 

a 

We  solve  (17)  using  the  Lebesgue  measure  (42);  the  solution 
with  the  Fisher  differential  element  follows  the  same  logic. 
Then  (41)  is  equivalent  to 

\b  —  a\  1 

^ — -X  max 


argmin  J(b)  =  j 
b  J  a- 


r  da 


rb 

%=x„ 


b  —  a  da 
ab  an+ 3/2 
2 


ab  an+3/2 
f°°  a-b  da 


ab  an+ 3/2 

1 


r7X-\~  2 

2  max 


(n  +  1/2)  (n  +  3/2)6”+3/2  6(n+i)^ 

1 

(n  +  3/2)Xmtx/2  ' 

The  minimum  is  found  by  setting  the  first  derivative  to  zero: 

2  (n  +  3/2) 


J'(6)  = 


(n  +  l/2)(n  +  3/2)  frn+5/2 

1  =0 


S2(n  +  1/2)X”+1/2 
=  2™Xmax. 

To  establish  that  b  is  in  fact  a  minimum,  note  that 

>  0. 


J"{b)  = 


1 


1 


h  \^n+V2 

yAmax 


9  n.  +  l/6  Vn+7/‘2 

A  1  yvmax 


Thus,  the  restricted  Bayesian  estimate  is  the  uniform  distribu¬ 
tion  over  [0, 2  "+\'/2  Xmax], 
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